Chapter 7 Chinese Text Processing

In this chapter, we will turn to the topic of Chinese text processing. In particular, we will discuss one of the most important issues in Chinese language processing, i.e., word segmentation.

When we discuss English parts-of-speech tagging in Chapter 5, it is easy to perform (word) tokenization on English texts because the word boundaries in English are more clearly delimited by whitespaces. Chinese, however, does not have whitespaces between characters, which leads to a serious problem for word tokenization.

We will look at the issues of word tokenization and talk about the most-often used library, jiebaR, for Chinese word segmentation. Also, we will include several case studies on Chinese text processing.

In later Chapter 9, we will introduce another segmenter developed by the CKIP Group at the Academia Sinica. The CKIP Tagger seems to be the state-of-art tagger for Taiwan Mandarin, i.e., with more additional functionalities.

library(tidyverse)
library(tidytext)
library(quanteda)
library(stringr)
library(jiebaR)
library(readtext)

7.1 Chinese Word Segmenter jiebaR

7.1.1 Start

First, if you haven’t installed the library jiebaR, you may need to install it manually:

install.packages("jiebaR")
library("jiebaR")

This is the version used for this tutorial.

packageVersion("jiebaR")
[1] '0.11'

Now let us take a look at a quick example. Let us assume that in our corpus, we have collected only one text document, with only a short paragraph.

text <- "綠黨桃園市議員王浩宇爆料,指民眾黨不分區被提名人蔡壁如、黃瀞瑩,在昨(6)日才請辭是為領年終獎金。台灣民眾黨主席、台北市長柯文哲7日受訪時則說,都是按流程走,不要把人家想得這麼壞。"

There are two important steps in Chinese word segmentation:

  • Initialize a jiebar object using worker()
  • Tokenize the texts into words using the function segment() with the designated jiebar object created earlier
seg1 <- worker()
segment(text, jiebar = seg1)
 [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指民眾"  
 [7] "黨"       "不"       "分區"     "被"       "提名"     "人"      
[13] "蔡壁如"   "黃"       "瀞"       "瑩"       "在昨"     "6"       
[19] "日"       "才"       "請辭"     "是"       "為領"     "年終獎金"
[25] "台灣民眾" "黨"       "主席"     "台北"     "市長"     "柯文"    
[31] "哲"       "7"        "日"       "受訪"     "時則"     "說"      
[37] "都"       "是"       "按"       "流程"     "走"       "不要"    
[43] "把"       "人家"     "想得"     "這麼"     "壞"      
class(seg1)
[1] "jiebar"  "segment" "jieba"  

To word-tokenize the document, text, you first initialize a jiebar object, i.e., seg1, using worker() and feed this jiebar to segment(jiebar = seg1)and tokenize text into words.

7.1.2 Parameters Setting

There are many different parameters you can specify when you initialize the jiebar object. You may get more detail via the documentation ?worker. Some of the important arguments include:

  • user = ...: This argument is to specify the path to a user-defined dictionary
  • stop_word = ...: This argument is to specify the path to a stopword list
  • symbol = FALSE: Whether to return symbols (the default is FALSE)
  • bylines = FALSE: Whether to return a list or not (crucial if you are using tidytext::unnest_tokens())

Exercise 7.1 In our earlier example, when we created the jiebar object named seg1, we did not specify any arguments for worker(). Can you tell what the default settings are for the parameters of worker()?

Please try to create worker() with different settings (e.g., symbols = T, bylines = T) and see how the tokenization results differ from each other.

7.1.3 User-defined dictionary

From the above example, it is clear to see that some of the words are not correctly identified by the current segmenter: for example, 民眾黨, 不分區, 黃瀞瑩, 柯文哲.

It is always recommended to include a user-defined dictionary when tokenizing your texts because different corpora may have their own unique vocabulary (i.e., domain-specific lexicon).

This can be done with the argument user = ... when you initialize the jiebar object, i.e, worker(..., user = ...).

seg2 <- worker(user = "demo_data/dict-ch-user-demo.txt")
segment(text, seg2)
 [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指"      
 [7] "民眾黨"   "不分區"   "被"       "提名"     "人"       "蔡壁如"  
[13] "黃瀞瑩"   "在昨"     "6"        "日"       "才"       "請辭"    
[19] "是"       "為領"     "年終獎金" "台灣"     "民眾黨"   "主席"    
[25] "台北"     "市長"     "柯文哲"   "7"        "日"       "受訪"    
[31] "時則"     "說"       "都"       "是"       "按"       "流程"    
[37] "走"       "不要"     "把"       "人家"     "想得"     "這麼"    
[43] "壞"      

The format of the user-defined dictionary is a text file, with one word per line. Also, the default encoding of the dictionary is UTF-8.

Please note that in Windows, the default encoding of a Chinese txt file created by Notepad may not be UTF-8. (Usually, it is encoded in big-5).

Also, files created by MS Office applications tend to be less transparent in terms of their encoding.

Creating a user-defined dictionary may take a lot of time. You may consult 搜狗詞庫, which includes many domain-specific dictionaries created by others.

However, it should be noted that the format of the dictionaries is .scel. You may need to convert the .scel to .txt before you use it in jiebaR.

To do the coversion automatically, please consult the library cidian.

Also, you need to do the traditional-simplified Chinese conversion as well. For this, you may consult the library ropencc in R.

7.1.4 Stopwords

When you initialize the worker(), you can also specify a stopword list, i.e., words that you do not need to include in the later analyses.

For example, in text mining, functional words are usually less informative, thus often excluded in the process of preprocessing.

seg3 <- worker(user = "demo_data/dict-ch-user-demo.txt",
               stop_word = "demo_data/stopwords-ch-demo.txt")
segment(text, seg3)
 [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指"      
 [7] "民眾黨"   "不分區"   "被"       "提名"     "人"       "蔡壁如"  
[13] "黃瀞瑩"   "在昨"     "6"        "才"       "請辭"     "為領"    
[19] "年終獎金" "台灣"     "民眾黨"   "主席"     "台北"     "市長"    
[25] "柯文哲"   "7"        "受訪"     "時則"     "說"       "按"      
[31] "流程"     "走"       "不要"     "把"       "人家"     "想得"    
[37] "這麼"     "壞"      
Exercise 7.2 How do we quickly check which words in segment(text, seg2) were removed as compared to the results of segment(text, seg3)? (Note: seg2 and seg3 only differ in the stop_word=... argument.)
[1] "日" "是" "都"

7.1.5 POS Tagging

So far we haven’t seen the parts-of-speech tags provided by the word segmenter. If you need the POS tags of the words, you need to specify the argument type = "tag" when you initialize the worker().

seg4 <- worker(type = "tag", 
               user = "demo_data/dict-ch-user-demo.txt", 
               stop_word = "demo_data/stopwords-ch-demo.txt",
               symbol = T)
segment(text, seg4)
         n         ns          n          x          n          x          n 
    "綠黨"   "桃園市"     "議員"   "王浩宇"     "爆料"       ","       "指" 
         n          n          p          v          n          x          x 
  "民眾黨"   "不分區"       "被"     "提名"       "人"   "蔡壁如"       "、" 
         n          x          x          x          x          x          d 
  "黃瀞瑩"       ","     "在昨"       "("        "6"       ")"       "才" 
         v          x          n          x          x          n          n 
    "請辭"     "為領" "年終獎金"       "。"     "台灣"   "民眾黨"     "主席" 
         x         ns          n          n          x          v          x 
      "、"     "台北"     "市長"   "柯文哲"        "7"     "受訪"     "時則" 
        zg          x          p          n          v          x         df 
      "說"       ","       "按"     "流程"       "走"       ","     "不要" 
         p          n          x          r          a          x 
      "把"     "人家"     "想得"     "這麼"       "壞"       "。" 

The returned object is a named character vector, i.e., the POS tags of the words are included in the names of the vectors.


Every POS tagger has its own predefined tagset. The following table lists the annotations of the POS tagset used in jiebaR:

Exercise 7.3 How do we convert the named word vector with POS tags returned by segment(text, seg4) into a long string as shown below?
[1] "綠黨/n 桃園市/ns 議員/n 王浩宇/x 爆料/n ,/x 指/n 民眾黨/n 不分區/n 被/p 提名/v 人/n 蔡壁如/x 、/x 黃瀞瑩/n ,/x 在昨/x (/x 6/x )/x 才/d 請辭/v 為領/x 年終獎金/n 。/x 台灣/x 民眾黨/n 主席/n 、/x 台北/ns 市長/n 柯文哲/n 7/x 受訪/v 時則/x 說/zg ,/x 按/p 流程/n 走/v ,/x 不要/df 把/p 人家/n 想得/x 這麼/r 壞/a 。/x"

7.1.6 Default Word Lists in JiebaR

You can check the dictionaries and the stopword list being used by jiebaR in your current environment:

# show files under `dictpath`
dir(show_dictpath())

# Check the default stop_words list
# Please change the path to your default dict path
# scan(file="/Library/Frameworks/R.framework/Versions/3.6/Resources/library/jiebaRD/dict/stop_words.utf8",
#       what=character(),nlines=50,sep='\n',
#       encoding='utf-8',fileEncoding='utf-8')

readLines("/Library/Frameworks/R.framework/Versions/4.1/Resources/library/jiebaRD/dict/stop_words.utf8", n = 100)

7.1.7 Reminders

When we use segment() as a tokenization method in the unnest_tokens(), it is very important to specify bylines = TRUE in worker().

This setting would make sure that segment() takes a text-based vector as input and returns a list of word-based vectors of the same length as output.

NB: When bylines = FALSE, segment() returns a vector.

seg_byline_1 <- worker(bylines = T)
seg_byline_0 <- worker(bylines = F)

(text_tag_1 <- segment(text, seg_byline_1))
[[1]]
 [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指民眾"  
 [7] "黨"       "不"       "分區"     "被"       "提名"     "人"      
[13] "蔡壁如"   "黃"       "瀞"       "瑩"       "在昨"     "6"       
[19] "日"       "才"       "請辭"     "是"       "為領"     "年終獎金"
[25] "台灣民眾" "黨"       "主席"     "台北"     "市長"     "柯文"    
[31] "哲"       "7"        "日"       "受訪"     "時則"     "說"      
[37] "都"       "是"       "按"       "流程"     "走"       "不要"    
[43] "把"       "人家"     "想得"     "這麼"     "壞"      
(text_tag_0 <- segment(text, seg_byline_0))
 [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指民眾"  
 [7] "黨"       "不"       "分區"     "被"       "提名"     "人"      
[13] "蔡壁如"   "黃"       "瀞"       "瑩"       "在昨"     "6"       
[19] "日"       "才"       "請辭"     "是"       "為領"     "年終獎金"
[25] "台灣民眾" "黨"       "主席"     "台北"     "市長"     "柯文"    
[31] "哲"       "7"        "日"       "受訪"     "時則"     "說"      
[37] "都"       "是"       "按"       "流程"     "走"       "不要"    
[43] "把"       "人家"     "想得"     "這麼"     "壞"      
class(text_tag_1)
[1] "list"
class(text_tag_0)
[1] "character"

7.2 Chinese Text Analytics Pipeline

In Chapter 5, we have talked about the pipeline for English texts processing, as shown below:


Figure 7.1: English Text Analytics Flowchart


For Chinese texts, the pipeline is similar.

In the following Chinese Text Analytics Flowchart (Figure 7.2), I have highlighted the steps that are crucial to Chinese processing.

  • It is not recommended to use quanteda::summary() and quanteda::kwic() directly on the Chinese corpus object because the word tokenization in quanteda is not ideal (cf. dashed arrows in Figure 7.2).
  • It is recommended to use self-defined word segmenter for analysis. For processing under tidy structure framework, use own segmenter in unnest_tokens(); for processing under quanteda framework, create the tokens object, which is defined in quanteda as well.

Figure 7.2: Chinese Text Analytics Flowchart


It is important to note that when we specify a self-defined unnest_tokens(…,token=…) function, this function should take a character vector (i.e., a text-based vector) and return a list of character vectors (i.e., word-based vectors) of the same length.

In other words, when initializing the Chinese word segmenter, we need to specify the argument worker(…, byline = TRUE).


7.2.1 Creating a Corpus Object

So based on our simple corpus example above, we first transform the character vector text into a corpus object—text_corpus. With this, like with the English data, we can apply quanteda::summary() and quanteda::kwic() with the corpus object.

## create corpus object
text_corpus <- text %>% 
  corpus

## summary
summary(text_corpus)
## Create tokens object
text_tokens <- tokens(text_corpus)

## KWIC
kwic(text_tokens, pattern = "柯文哲")
kwic(text_tokens, pattern = "柯")
Exercise 7.4 Do you know why there are no tokens of concordance lines from kwic(text_corpus, pattern = "柯文哲")?

7.2.2 Tidy Structure Framework

We can transform the corpus object into a text-based TIBBLE using tidy(). Also, we generate an unique index for each row/text using row_number().

# a text-based tidy corpus
text_corpus_tidy <-text_corpus %>%
  tidy %>% 
  mutate(textID = row_number())

text_corpus_tidy
  • For word segmentation, we initialize the jiebar object using worker().
# initialize segmenter
my_seg <- worker(bylines = T, 
                 user = "demo_data/dict-ch-user-demo.txt", 
                 symbol=T)
  • Finally, we use unnest_tokens() to tokenize the text-based TIBBLE text_corpus_tidy into a word-based TIBBLE text_corpus_tidy_word. That is, texts included in the text column are tokenized into words, which are unnested into rows of the word column in the new TIBBLE.
# tokenization
text_corpus_tidy_word <- text_corpus_tidy %>%
  unnest_tokens(
    word,                 ## new tokens
    text,                 ## original larger units
    token = function(x)   ## self-defined tokenization method
      segment(x, jiebar = my_seg)
  )

text_corpus_tidy_word

It can be seen that for the token parameter in unnest_tokens(), we use an anonymous function based on jieba and segment() for self-defined Chinese word segmentation.

This is called anonymous functions because it has not been assigned to any object name in the current R session.

You may check R language documentation for more detail on Writing Functions.

7.2.3 Quanteda Framework

  • Under the quanteda framework, we can also create the tokens object of the corpus and do kwic() search.
  • Most of the functions that work with corpus object can also work with tokens object in quanteda.
## create tokens based on self-defined segmentation
text_tokens <- text_corpus_tidy$text %>%
  segment(jiebar = my_seg) %>%
  as.tokens 

## kwic on word tokens
kwic(text_tokens,
     pattern = "柯文哲")
kwic(text_tokens,
     pattern = ".*?[台市].*?", valuetype = "regex")

7.3 Comparing Tokenization Methods

quanteda also provides its own default word tokenization for Chinese texts. However, its default tokenization method does not allow us to add our own dictionary to the segmentation process, which renders the results less reliable.

We can compare the two results.

  • we can use quanteda::tokens() to see how quanteda tokenizes Chinese texts. The function returns a tokens object.
# create TOKENS object using quanteda default
text_corpus %>%
  tokens -> text_tokens_qtd
  • we can also use our own tokenization function segment() and convert the list to a tokens object using as.tokens(). (This of course will give us the same tokenization result as we get in the earlier unnest_tokens() because we are using the same segmenter my_seg.)
# create TOKENS object manually
text_corpus %>% 
  segment(jiebar = my_seg) %>% 
  as.tokens -> text_tokens_jb

Now let’s compare the two resulting tokens objects:

  • These are the tokens based on self-defined segmenter:
# compare our tokenization with quanteda tokenization
text_tokens_jb[[1]]
 [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     ","      
 [7] "指"       "民眾黨"   "不分區"   "被"       "提名"     "人"      
[13] "蔡壁如"   "、"       "黃瀞瑩"   ","       "在昨"     "("      
[19] "6"        ")"       "日"       "才"       "請辭"     "是"      
[25] "為領"     "年終獎金" "。"       "台灣"     "民眾黨"   "主席"    
[31] "、"       "台北"     "市長"     "柯文哲"   "7"        "日"      
[37] "受訪"     "時則"     "說"       ","       "都"       "是"      
[43] "按"       "流程"     "走"       ","       "不要"     "把"      
[49] "人家"     "想得"     "這麼"     "壞"       "。"      
  • These are the tokens based on default quanteda tokenizer:
text_tokens_qtd[[1]]
 [1] "綠黨"     "桃園市"   "議員"     "王"       "浩"       "宇"      
 [7] "爆"       "料"       ","       "指"       "民眾"     "黨"      
[13] "不"       "分區"     "被"       "提名"     "人"       "蔡"      
[19] "壁"       "如"       "、"       "黃"       "瀞"       "瑩"      
[25] ","       "在"       "昨"       "("       "6"        ")"      
[31] "日"       "才"       "請辭"     "是"       "為"       "領"      
[37] "年終獎金" "。"       "台灣"     "民眾"     "黨主席"   "、"      
[43] "台北市"   "長"       "柯"       "文"       "哲"       "7"       
[49] "日"       "受"       "訪"       "時"       "則"       "說"      
[55] ","       "都是"     "按"       "流程"     "走"       ","      
[61] "不要"     "把"       "人家"     "想得"     "這麼"     "壞"      
[67] "。"      

Therefore, for linguistic analysis, the recommended method is to define own Chinese word segmenter using jiebaR, which is tailored to specific tasks/corpora. Plus, we can define the user dictionary for domain-specific texts/corpora.

7.4 Data

In the following sections, we look at a few more case studies of Chinese text processing using the news articles collected from Apple News as our example corpus. The dataset is available in our course dropbox drive: demo_data/applenews10000.tar.gz.

This data set was collected by Meng-Chen Wu when he was working on his MA thesis project with me years ago. The demo data here was a random sample of the original Apple News Corpus.

7.5 Loading Text Data

When we need to load text data from external files (e.g., txt, tar.gz files), there is a simple and powerful R package for loading texts: readtext. The main function in this package, readtext(), which takes a file or a directory name from the disk or a URL, and returns a type of data.frame that can be used directly with the corpus() constructor function in quanteda, to create a quanteda corpus object.

In other words, the output from readtext is a data frame, which can be directly passed on to the processing in the tidy structure framework (i.e., tidytext::unnest_tokens()).

The function readtext() works on:

  • text (.txt) files;
  • comma-separated-value (.csv) files;
  • XML formatted data;
  • data from the Facebook API, in JSON format;
  • data from the Twitter API, in JSON format; and
  • generic JSON data.

The corpus constructor command corpus() works directly on:

  • a vector of character objects, for instance that you have already loaded into the workspace using other tools;
  • a data.frame containing a text column and any other document-level metadata
  • the output of readtext::readtext()
# loading the corpus
# NB: this may take some time

## create `corpus`
apple_corpus <- corpus(readtext("demo_data/applenews10000.tar.gz"))

## summary corpus
summary(apple_corpus, 10)
## create `tokens`
apple_tokens_QTD <- tokens(apple_corpus)

7.6 quanteda::tokens() vs. jiebaR::segment()

In Chapter 4, we’ve seen that after we create a corpus object, we can convert the corpus into tokens and apply kwic() to get the concordance lines of a particular word/expression.

We can do the same the with Chinese texts as well:

kwic(apple_tokens_QTD, "勝率")

As mentioned in Section 7.3, quanteda applies its built-in tokenization method (i.e., tokens()) for Chinese word segmentation.

It uses the tokenizer, stringi::stri_split_boundaries, which utilizes a library called ICU (International Components for Unicode) and the library uses dictionaries for segmentation of texts in Chinese. The biggest problem with this built-in method is that we cannot add our own dictionary when using the default tokenization tokens() (at least I don’t know how).

Like we did in Section 7.3, we can compare the word segmentation results between quanteda defaults and jiebaR (with own dictionary) with our current news corpus.

To create tokens using jiebaR:

  • First we initialize the jiebarwith our self-defined dictionary;
  • Second, we use it to tokenize all texts in apple_corpus along with jiebaR::segment()
  • Finally, we convert the returned list from jiebaR::segment() into a quanteda tokens object using as.tokens().
# Initialize the segmenter
segmenter <- worker(user="demo_data/dict-ch-user.txt", 
                    bylines = T, 
                    symbol = T)

# Tokenization using jiebaR
apple_corpus %>% 
  segment(jiebar = segmenter) %>%
  as.tokens -> apple_tokens_JB

Now we can compare the two versions of the tokens. Let’s take a look at the first document:

apple_tokens_JB[[1]] %>% length
[1] 168
apple_tokens_QTD[[1]] %>% length
[1] 148
apple_tokens_JB[[1]] %>% as.character
  [1] "《"     "蘋果"   "體育"   "》"     "即日起" "進行"   "虛擬"   "賭盤"  
  [9] "擂台"   ","     "每名"   "受邀"   "參賽者" "進行"   "勝負"   "預測"  
 [17] ","     "每周"   "結算"   "在"     "周二"   "公布"   ","     "累積"  
 [25] "勝率"   "前"     "3"      "高"     "參賽者" "可"     "繼續"   "參賽"  
 [33] ","     "單周"   "勝率"   "最高者" ","     "將"     "加封"   "「"    
 [41] "蘋果"   "波神"   "」"     "頭銜"   "。"     "註"     ":"      "賭盤"  
 [49] "賠率"   "如有"   "變動"   ","     "以"     "台灣"   "運彩"   "為主"  
 [57] "。"     "\n"     "資料"   "來源"   ":"     "NBA"    "官網"   "http"  
 [65] ":"      "/"      "/"      "www"    "."      "nba"    "."      "com"   
 [73] "\n"     "\n"     "金塊"   "("      "客"     ")"      " "      "103"   
 [81] ":"     "92"     " "      "76"     "人"     "騎士"   "("      "主"    
 [89] ")"      " "      "88"     ":"     "82"     " "      "快艇"   "活塞"  
 [97] "("      "客"     ")"      " "      "92"     ":"     "75"     " "     
[105] "公牛"   "勇士"   "("      "客"     ")"      " "      "108"    ":"    
[113] "82"     " "      "灰熊"   "熱火"   "("      "客"     ")"      " "     
[121] "103"    ":"     "82"     " "      "灰狼"   "籃網"   "("      "客"    
[129] ")"      " "      "90"     ":"     "82"     " "      "公鹿"   "溜"    
[137] "馬"     "("      "客"     ")"      " "      "111"    ":"     "100"   
[145] " "      "馬刺"   "國王"   "("      "客"     ")"      " "      "112"   
[153] ":"     "102"    " "      "爵士"   "小牛"   "("      "客"     ")"     
[161] " "      "108"    ":"     "106"    " "      "拓荒者" "\n"     "\n"    
apple_tokens_QTD[[1]] %>% as.character 
  [1] "《"                 "蘋果"               "體育"              
  [4] "》"                 "即日起"             "進行"              
  [7] "虛擬"               "賭"                 "盤"                
 [10] "擂台"               ","                 "每名"              
 [13] "受邀"               "參賽者"             "進行"              
 [16] "勝負"               "預測"               ","                
 [19] "每周"               "結算"               "在"                
 [22] "周二"               "公布"               ","                
 [25] "累積"               "勝率"               "前"                
 [28] "3"                  "高"                 "參賽者"            
 [31] "可"                 "繼續"               "參賽"              
 [34] ","                 "單"                 "周"                
 [37] "勝率"               "最高"               "者"                
 [40] ","                 "將"                 "加封"              
 [43] "「"                 "蘋果"               "波"                
 [46] "神"                 "」"                 "頭銜"              
 [49] "。"                 "註"                 ":"                 
 [52] "賭"                 "盤"                 "賠"                
 [55] "率"                 "如有"               "變動"              
 [58] ","                 "以"                 "台灣"              
 [61] "運"                 "彩"                 "為主"              
 [64] "。"                 "資料"               "來源"              
 [67] ":"                 "NBA"                "官"                
 [70] "網"                 "http://www.nba.com" "金塊"              
 [73] "("                  "客"                 ")"                 
 [76] "103"                ":"                 "92"                
 [79] "76"                 "人"                 "騎士"              
 [82] "("                  "主"                 ")"                 
 [85] "88"                 ":"                 "82"                
 [88] "快艇"               "活塞"               "("                 
 [91] "客"                 ")"                  "92"                
 [94] ":"                 "75"                 "公牛"              
 [97] "勇士"               "("                  "客"                
[100] ")"                  "108"                ":"                
[103] "82"                 "灰"                 "熊"                
[106] "熱火"               "("                  "客"                
[109] ")"                  "103"                ":"                
[112] "82"                 "灰"                 "狼"                
[115] "籃網"               "("                  "客"                
[118] ")"                  "90"                 ":"                
[121] "82"                 "公鹿"               "溜"                
[124] "馬"                 "("                  "客"                
[127] ")"                  "111"                ":"                
[130] "100"                "馬"                 "刺"                
[133] "國王"               "("                  "客"                
[136] ")"                  "112"                ":"                
[139] "102"                "爵士"               "小牛"              
[142] "("                  "客"                 ")"                 
[145] "108"                ":"                 "106"               
[148] "拓荒者"            
kwic(apple_tokens_JB, "勝率", window = 10)
kwic(apple_tokens_QTD, "勝率", window = 10)

Any significant differences in the word tokenization?


To work with the Chinese texts, if you need to utilize more advanced text-analytic functions provided by quanteda, I would probably suggest performing the word tokenization on the texts using your own word segmenter first and convert the object into a tokens, which can then be properly passed on to other functions in quanteda (e.g., kwic, dfm).

In the later demonstrations, we will use our own defined segmenter based on jiebaR for Chinese word segmentation.

7.7 Case Study 1: Word Frequency and Wordcloud

We follow the same steps as illustrated in the above flowchart 7.2 and process the Chinese texts based on the tidy structure framework:

  • Load the corpus data using readtext() and create a text-based data frame of the corpus;
  • Initialize a jieba word segmenter using worker()
  • Tokenize the text-based data frame into a word-based tidy data frame using unnest_tokens()

Please note that the output of readtext() is already a data frame, i.e., a tidy structure of the corpus. We did not create the quanteda corpus object because in this example we did not need it for further processing with other quanteda functions (e.g., kwic()).

# loading corpus
apple_df <- readtext("demo_data/applenews10000.tar.gz") %>%
  filter(text != "") %>% #remove empty documents
  mutate(filename = doc_id, ## save original text filenames
         doc_id = row_number()) %>% # revise/create new document index
  data.frame()
# Initialize the `jiebar`
segmenter_word <- worker(user="demo_data/dict-ch-user.txt", 
                    bylines = T, 
                    symbol = T)
# Tokenization: Word-based DF
apple_word <- apple_df %>%
  unnest_tokens(
    output = word,
    input = text,
    token = function(x)
      segment(x, jiebar = segmenter_word)
  ) %>%
  group_by(doc_id) %>%
  mutate(word_id = row_number()) %>% # create word index within each document
  ungroup
apple_word %>% head(100)

These tokenization results should be the same as our earlier apple_tokens_JB:

apple_word %>%
  filter(doc_id == 1) %>%
  mutate(word_2 = apple_tokens_JB[[1]])

Creating unique indices for your data is very important. In corpus linguistic analysis, we often need to keep track of the original context of the word, phrase or sentence in the concordances. All these unique indices (as well as the source text filenames) would make things a lot easier.

Also, if the metadata of the source documents are available, these unique indices would allow us to connect the tokenized linguistic units to the metadata information (e.g., genres, registers, author profiles) for more interesting analysis.


With a word-based data frame, we can easily generate a word frequency list as well as a word cloud to have a quick overview of the word distribution of the corpus.

stopwords_chi <- readLines("demo_data/stopwords-ch.txt")
apple_word_freq <- apple_word %>%
  filter(!word %in% stopwords_chi) %>% # remove stopwords
  filter(word %>% str_detect(pattern = "\\D+")) %>% # remove words consisting of digits
  count(word) %>%
  arrange(desc(n))

library(wordcloud2)
apple_word_freq %>%
  filter(n > 400) %>%
  filter(nchar(word) >= 2) %>%
  wordcloud2(shape = "star", size = 0.3)

7.8 Case Study 2: Patterns

In this case study, we are looking at a more complex example. In corpus linguistic analysis, we often need to extract a particular pattern from the texts. In order to retrieve the target patterns at a high accuracy rate, we often need to make use of the additional annotations provided by the corpus.

The most often-used information is the parts-of-speech tags of words. So here we demonstrate how to add POS tags information to our current tidy corpus design.

Our steps are as follows:

  1. Initialize jiebar object, which performs not only word segmentation but also POS tagging;
  2. Create a self-defined function to word-seg and pos-tag each text and combine all tokens, word/tag, into a long string for each text;
  3. With the text-based apple_df, create a new column, which includes the enriched version of each text, using mutate()
# initialize `jiebar`
segmenter_word_pos <- worker(
  type = "tag",
  # get pos
  user = "demo_data/dict-ch-user.txt",
  # use own dict
  symbol = T,
  # keep symbols
  bylines = FALSE
)

# define a function to word-seg and pos-tag a text
tag_text <- function(x, jiebar) {
  segment(x, jiebar) %>%
    paste(names(.), sep = "/", collapse = " ")
}
# demo of the function `tag_text()`
tag_text(apple_df$text[2], segmenter_word_pos)
[1] "【/x 動/v 新聞/n ╱/x 綜合/vn 報導/n 】/x 新北市/x 一名/m 18/m 歲/zg 李姓/x 男子/n ,/x 疑因/x 女友/n 要/v 分手/v ,/x 避不見面/i 也/d 不/d 接電話/l ,/x 他/r 為/zg 見/v 女友/n 一面/m ,/x 挽回/v 感情/n ,/x 竟學/x 蜘蛛人/n 攀爬/v 鐵窗/n ,/x 欲/d 潛入/v 女友/n 位於/v 5/x 樓/n 住處/n ,/x 行經/n 2/x 樓時/x 被/p 住戶/n 發現/v ,/x 他/r 誆稱/x 「/x 撿/v 鑰匙/n 」/x 矇混過關/l ,/x 但/c 仍/zg 被/p 4/x 樓/n 住戶/n 懷疑/v 是/v 小偷/d 報案/n ,/x 最後/x 雖/zg 成功/x 進入/v 5/x 樓/n 見到/v 女友/n ,/x 仍/zg 無法挽回/i 感情/n ,/x 因/p 侵入/v 住宅/n 罪嫌/n 被/p 帶回/v 警局/x ,/x 女方/n 家屬/n 不/d 提告/x 作罷/v 。/x 警方/n 指出/v ,/x 李男為/x 挽回/v 感情/n ,/x 鋌而走險/x 攀爬/v 鐵窗/n 至/p 5/x 樓/n ,/x 其後/x 方/n 就是/d 一處/m 工地/n ,/x 萬一/x 失足/v 墜落/v ,/x 後果/n 不堪設想/i ,/x 經/zg 勸說/v 後/f ,/x 讓/v 李/nr 男/n 離去/v 。/x \n/x  /x"
# apply `tag_text()` function to each text
system.time(apple_df %>%
              mutate(text_tag = map_chr(text, tag_text, segmenter_word_pos)) -> apple_df)
   user  system elapsed 
  4.949   0.085   5.038 
apple_df %>% head

7.8.1 BEI Construction

This section will show you how we can make use of the POS tags for construction analysis. Let’s look at the example of 被 + ... Construction.

The data retrieval procedure is now very straightforward: we only need to create a regular expression that matches our construction and go through the enriched version of the texts (i.e., text_tag column in apple_df) to identify these matches with unnest_tokens().

1.Define a regular expression \\b被/p\\s([^/]+/[^\\s]+\\s)*?[^/]+/v for BEI-Construction, i.e., 被 + VERB 2.Use unnest_tokens() and str_extract_all() to extract target patterns and create a pattern-based data frame.

# define regex patterns
pattern_bei <- "\\b被/p\\s([^/]+/[^\\s]+\\s)*?[^/]+/v"

# extract patterns from corp
apple_df %>%
  select(-text) %>% # `text` is the column with original raw texts
  unnest_tokens(
    output = pat_bei,
    ## pattern name
    input = text_tag,
    ## original base linguistic unit
    token = function(x)
      str_extract_all(x, pattern = pattern_bei)
  ) -> result_bei

result_bei

Please check Chapter 5 Parts of Speech Tagging on evaluating the quality of the data retrieved by a regular expression (i.e., precision and recall).


To have a more in-depth analysis of BEI construction, we can automatically identify the verb of the BEI construction.

# Extract BEI + WORD
result_bei <- result_bei %>%
  mutate(VERB = str_replace(pat_bei, ".+\\s([^/]+)/v$", "\\1"))

result_bei
## Exploratory Analysis
result_bei %>%
  count(VERB) %>%
  top_n(40, n) %>%
  ggplot(aes(x=reorder(VERB, n), y =n, fill=n)) +
  geom_bar(stat="identity") +
  coord_flip() +
  labs(x = "Verbs in BEI Constructions", y = "Frequency")

# Calculate WORD frequency
require(wordcloud2)
result_bei %>%
  count(VERB) %>%
  mutate(n = log(n)) %>%
  top_n(100, n) %>%
  wordcloud2(shape = "diamond", size = 0.3)


Exercise 7.5 When you take a closer look at the resulting word cloud above, you would see the copular verb 是 showing up in the graph, which is counter to our native-speaker intuition. How do you check the instances of these 是 tokens? After you examine these cases, what do you think may be the source of the problem?
Exercise 7.6 To more properly evaluate the quality of the pattern queries, it would be great if we still have the original texts available in the resulting data frame result_bei. How do we keep this information? That is, please have one column in result_bei, which shows the original texts from which the construction token is extracted.

Exercise 7.7 Please use the sample corpus, apple_df as your data source and extract Chinese particle constructions of ... 外/內/中. Usually a space particle construction like these consists of a landmark NP (LM) and the space particle (SP).

For example, in 任期內, 任期 is the landmark NP and is the space particle. In this exercise, we will naively assume that the word directly preceding the space particle is our landmark NP head noun.

Please (a) extract all construction tokens with these space particles and (b) at the same time identify their respective SP and LM, as shown below.

Exercise 7.8 Following Exercise 7.7, please generate a frequency list of the LMs for each space particle. Show us the top 10 LMs of each space particle and arrange the frequencies of the LMs in a descending order, as shown below.

Also, you may visualize the top 10 landmarks that co-occur with each space particle in a bar plot as shown below.


Exercise 7.9 Following Exercise 7.8, for each space particle, please create a word cloud of its co-occuring LMs based on the top 100 LMs of each particle.

PS: The word frequencies in the word clouds shown below are on a log scale.


Exercise 7.10 Based on the word clouds provided in Exercise 7.9, do you find any bizarre cases? Can you tell us why? What would be the problems? Or what did we do wrong in the text preprocessing that may lead to these cases?

Please discuss these issues in relation to the steps in our data processing, i.e., word segmentation, POS tagging, and pattern retrievals.

7.9 Case Study 3: Lexical Bundles

7.9.1 N-grams Extraction

With word boundaries, we can also analyze the recurrent multiword units in Chinese news. Here let’s take a look at the recurrent four-grams in our Chinese corpus.

As the default n-gram tokenization in unnest_tokens() only works with the English data, we start this task by defining our own tokenization functions.

In this section, we define two functions:

  • ngram_chi(): This function takes a word-based vector and returns an ngram-based vector
  • tokenizer_ngrams(): This function takes texts vector and returns a list of ngram-based vectors
# Generate ngram sequences from a word vector
# By default, `word_vec` is assumed to be the word tokens of the text
ngram_chi <- function(word_vec,
                      n = 2,
                      delimiter = "_") {
  if (length(word_vec) >= n) {
    map2_chr(
      .x = 1:(length(word_vec) - n + 1),
      .y = n:length(word_vec),
      .f = function(x, y)
        str_c(word_vec[x:y], collapse = delimiter)
    )
  } else{
    return("")
  }#endif
}#endfunc

This ngram_chi() takes a word-based vector as an input, and returns a vector of n-grams.

# test
sents <- c("Jack and Jill went up the hill to fetch a pail of water", 
           "Jack fell down and broke his crown and Jill came tumbling after")
sents_tokens <- str_split(sents, pattern = "\\s+")
sents_tokens
[[1]]
 [1] "Jack"  "and"   "Jill"  "went"  "up"    "the"   "hill"  "to"    "fetch"
[10] "a"     "pail"  "of"    "water"

[[2]]
 [1] "Jack"     "fell"     "down"     "and"      "broke"    "his"     
 [7] "crown"    "and"      "Jill"     "came"     "tumbling" "after"   
map(sents_tokens, ngram_chi, n=3, delimiter = "_")
[[1]]
 [1] "Jack_and_Jill" "and_Jill_went" "Jill_went_up"  "went_up_the"  
 [5] "up_the_hill"   "the_hill_to"   "hill_to_fetch" "to_fetch_a"   
 [9] "fetch_a_pail"  "a_pail_of"     "pail_of_water"

[[2]]
 [1] "Jack_fell_down"      "fell_down_and"       "down_and_broke"     
 [4] "and_broke_his"       "broke_his_crown"     "his_crown_and"      
 [7] "crown_and_Jill"      "and_Jill_came"       "Jill_came_tumbling" 
[10] "came_tumbling_after"

Please note that ngram_chi() will not work properly if it takes a list of word-based vectors as the input. See below:

ngram_chi(sents_tokens)
[1] "c(\"Jack\", \"and\", \"Jill\", \"went\", \"up\", \"the\", \"hill\", \"to\", \"fetch\", \"a\", \"pail\", \"of\", \"water\")_c(\"Jack\", \"fell\", \"down\", \"and\", \"broke\", \"his\", \"crown\", \"and\", \"Jill\", \"came\", \"tumbling\", \"after\")"

If you would like this function to work with a list input, you need to modify the function.


Before we define the tokenization functions, we first initialize the jiebar object for jiebar.

# define `jiebar` for jiebar
segmenter_word <- worker(user="demo_data/dict-ch-user.txt", 
                    bylines = T, 
                    symbol = T)

Now we can define our own Chinese ngram tokenization function, tokenizer_ngrams():

# define own tokenizer for ngrams
tokenizer_ngrams <- function(input, jiebar, n, delimiter) {
  input %>% ## corpus texts vector
    segment(jiebar) %>% # segment texts into a list of word vectors
    map(ngram_chi, n, delimiter) # convert each word vector into ngram vectors
}
# examples
texts <- c("這是一個測試的句子",
           "這句子",
           "超短句",
           "最後一個超長的句子測試")

tokenizer_ngrams(
  input = texts,
  jiebar = segmenter_word,
  n = 2,
  delimiter = "_"
)
[[1]]
[1] "這是_一個" "一個_測試" "測試_的"   "的_句子"  

[[2]]
[1] "這_句子"

[[3]]
[1] "超短_句"

[[4]]
[1] "最後_一個" "一個_超長" "超長_的"   "的_句子"   "句子_測試"
tokenizer_ngrams(
  input = texts,
  jiebar = segmenter_word,
  n = 5,
  delimiter = "/"
)
[[1]]
[1] "這是/一個/測試/的/句子"

[[2]]
[1] ""

[[3]]
[1] ""

[[4]]
[1] "最後/一個/超長/的/句子" "一個/超長/的/句子/測試"

With the self-defined ngram tokenizer, we can now perform the ngram tokenization on our Chinese corpus:

  1. We transform the text-based data frame into a ngram-based data frame using unnest_tokens(...) with the self-defined tokenization function tokenizer_ngrams()

  2. We remove empty and unwanted n-grams entries

    • Empty ngrams due to short texts
    • Ngrams spanning punctuations, symbols, or paragraph breaks
    • Ngrams including alphanumeric characters
## from text-based to ngram-based
system.time(
  apple_df %>%
    unnest_tokens(
      ngram,
      text,
      token = function(x)
        tokenizer_ngrams(
          input = x,
          jiebar = segmenter_word,
          n = 4,
          delimiter = "_"
        )
    ) -> apple_ngram
) ## end system.time
   user  system elapsed 
 29.987   0.183  30.186 
## remove unwanted ngrams
apple_ngram2 <- apple_ngram %>%
  filter(nzchar(ngram)) %>% ## empty strings
  filter(!str_detect(ngram, "[^\u4E00-\u9FFF_]")) ## remove unwanted ngrams

In the above regular expression, the Unicode range [\u4E00-\u9FFF] includes frequently used Chinese characters. Therefore, the way we remove unwanted ngrams is to identify all the ngrams that include non-Chinese characters that are outside of this Unicode range (as well as the delimiter _).

For more information related to the Unicode range for the punctuations in CJK languages, please see this SO discussion thread.

7.9.2 Frequency and Dispersion

As we have discussed in Chapter 4, a multiword unit can be defined based on at least two important distributional properties:

  • The frequency of the whole multiword unit (i.e., frequency)
  • The number of texts where the multiword unit is observed (i.e., dispersion)

Now that we have the four-grams-based DF, we can compute their token frequencies and document frequencies in the corpus using the normal data manipulation tricks.

We set cut-offs for four-grams at: dispersion >= 5 (i.e., four-grams that occur in at least five different documents)

system.time(
  apple_ngram_dist <- apple_ngram2 %>%
    group_by(ngram) %>%
    summarize(freq = n(), dispersion = n_distinct(doc_id)) %>%
    filter(dispersion >= 5)
) #end system.time
   user  system elapsed 
 44.997   0.177  45.206 

Please take a look at the four-grams, both arranged by frequency and dispersion:

# arrange by dispersion
apple_ngram_dist %>%
  arrange(desc(dispersion)) %>% head(10)
# arrange by freq
apple_ngram_dist %>%
  arrange(desc(freq)) %>% head(10)

We can also look at four-grams with particular lexical words:

apple_ngram_dist %>%
  filter(str_detect(ngram, "被")) %>%
  arrange(desc(dispersion))
apple_ngram_dist %>%
  filter(str_detect(ngram, "以")) %>%
  arrange(desc(dispersion))

Exercise 7.11 In the above example, if we are only interested in the four-grams with the word , how can we revise the regular expression so that we can get rid of tokens like ngrams with 以及, 以上 etc.

7.10 Afterwords

Figure 7.3: Chinese Word Segmentation and POS Tagging

Tokenizations are complex in Chinese text processing. Many factors may need to be taken into account when determining the right tokenization method. While word segmentation is almost a necessary step in Chinese computational text analytics, several important questions may also be relevant to the data processing methods:

  1. Do you need the parts-of-speech tags of words in your research?
  2. What is the base unit you would like to work with? Texts? Paragraphs? Chunks? Sentences? N-grams? Words?
  3. Do you need non-word tokens such as symbols, punctuations, numbers, or alphabets in your analysis?

Your answers to the above questions should help you determine the most effective structure of the tokenization methods for your data.


Exercise 7.12 Please scrape the articles on the most recent 10 index pages of the PTT Gossipping board. Analyze all the articles whose titles start with [問卦], [新聞], or [爆卦] (Please ignore all articles that start with Re:).

Specifically, please create the word frequency list of these target articles by:

  • including only words that are tagged as nouns or verbs by JiebaR (i.e., all words whose POS tags start with n or v)
  • removing words on the stopword list (cf. demo_data/stopwords-ch.txt)
  • providing both the word frequency and dispersions (i.e., number of articles where it occurs)

In addition, please visualize your results with a wordcloud as shown below, showing the recent hot words based on these recently posted target articles on PTT Gossipping.

In the wordcloud, please include words whose (a) nchar() >=2, and (b) dispersion <= 5.

Note: For Chinese word segementation, you may use the dictionary provided in demo_data/dict-ch-user.txt

   user  system elapsed 
  6.192   0.054  51.661 
  • The target articles from PTT Gossipping:
  • Word Frequency List
  • Wordclound